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I We use a generalization of the Lindeberg principle developed by Sourav Chatterjee to prove 

universality properties for various problems in communications, statistical learning and random 
matrix theory. We also show that these systems can be viewed as the limiting case of a properly 
defined sparse system. The latter result is useful when the sparse systems are easier to analyze 
than their dense counterparts. The list of problems we consider is by no means exhaustive. We 
^ believe that the ideas can be used in many other problems relevant for information theory. 

^ ■ 1 Introduction 
> 

— ' The phenomenon of universality is common to many disciplines of science and engineering. A well 

■ known example is the central limit theorem which, in a simple version, says the following. Let 

, {Xi}i>i be a collection of i.i.d. random variables with mean zero and variance ]E[X?] = 1. Then 



=1 

d 



where — )• denotes convergence in distribution as n — )■ oo, and N(0, 1) is a Gaussian random variable 
^ I with mean zero and variance one. In particular, the central limit theorem implies that the distribution 
■ of n~^/^ X^iLi -^i is asymptotically independent of the details of the distribution of the summands Xi. 
In other words, its limit is "universal" for a large class of summands' distributions. Other examples 
include the limiting spectrum of random matrices [2] , and various properties of statistical mechanics 
models [3]. 

Examples in communications theory where universal properties have been established include the 
MIMO communications problem [4]. In these problems it was shown the capacity of the system is 
independent of the distribution of the fading coefficients and the spreading sequences respectively. 

A different research area in which universality ideas appear ubiquitous is compressed sensing. 
Donoho and Tanner [5] carried out a systematic empirical investigation of universality in this context. 
In particular they showed that the phase transition boundary in the sparsity-undersampling tradeoff 
is universal for a large class of sensing matrices. The precise location of this phase transition was 
determined earlier on in the case of Gaussian sensing matrices [6]. 
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A related phenomenon which we study here is the sparse-dense equivalence. As an example con- 
sider a uniformly random regular graph G„ of degree d over n vertices. Let An € M"^" be the symmet- 
ric matrix whose non-vanishing entries correspond to edges in G„ and take values in {+l/^/d, —1/^fd} 
independently and uniformly at random. As n — )• oo the spectral measure of such a matrix converges 
almost surely [71 [8] to a well defined limit /Jd(dA) supported on [—2^1 — \ld^2^J\ — l/d], where: 



If we now consider the d — ?• oo limit, this distribution converges weakly to the celebrated semi-circle 
law 

/Ooo(dA) = ^v'4-A2dA. (2) 

This is the limiting spectrum of the standard (dense) Wigner matrices. We refer to this type of 
property as to a sparse-dense equivalence. Showing such a relationship can be particularly useful 
when the analysis of the sparse system is easier than its dense counterpart. Specific examples will 
be provided below. 

Universality and sparse-dense equivalence can have far reaching consequences in communications 
and information theory. In this paper, we demonstrate this by studying both phenomena within a 
common framework, and obtaining new results in each of the above mentioned problems. The main 
tool that we use is the following generalization of Lindeberg's principle that was proved in [1]. 

1.1 Lindeberg Principle 

Given / : M" — )• M, the generalized Lindeberg principle provides conditions under which the distri- 
bution of f{Xi, . . . , Xn) is approximately insensitive to the distribution of its arguments Xi, . . . , X^ 
which are assumed to be independent. This generalizes the classical Lindeberg proof of the central 
limit theorem, that focused on f{xi, . . . , x„) = (xi + • • • + x„)/-y/n. 
Let us restate here the main result of pQ. 

Theorem 1 (Generalized Lindeberg Principle, [l]). Let U = (C/i, . . . , Un) and V = {Vi, . . . , Vn) be 
two random vectors with mutually independent components. For 1 < i < n, define 

ai = \E[Ui]-E[Vi\l 
h = \nuf]-E[V^]\. 

and further assume maxj(E{|[/jp} + E{|V^|'^}) < M3. Suppose f : M" — )■ M is a thrice continuously 
differentiahle function, and for r = 1,2,3, let Lr{f) be a finite constant such that |9[/(n)| < Lr{f) 
for each i and u e M", where denotes the r-fold derivative in the ith coordinate. Then 

n 

\E[f{U)] - E[f{V)]\ < Y.{aMf) + -bMf)) + ^nLs{f)Ms. 

i=l 

Notice that, while this theorem explicitly bounds the change in expectation /( • ), it gives control 
on its distribution as well, by applying it to g{f{ ■ )), for 5 : M — )• M belonging to a suitable class of 
test functions. 

In many problems of interest for this paper, the bound on the derivatives of / required by the last 
theorem does not hold, and a more careful analysis is needed. For that purpose we use the following 
theorem. The proof is analogous to the one of Theorem [H and is provided in Section [3l 



2 



Theorem 2. Let U = {Ui, . . . ,Un) and V = (Vi, . . . ,Vn) be two random vectors with mutually 
independent components. Let {ai}i<i<n and {6j}i<j<„ be as defined in TheoremUl Then 



2 Applications 

In this section we discuss the apphcation of Theorem [2] to a problem from communications theory 
(code division muhiple access channels), and one from statistical learning theory (estimation via 
LASSO). We also revisit a standard model from statistical mechanics (the Sherrington-Kirkpatrick 
model), and the spectrum of Wishart matrices, which is related to capacity of MIMO channels. In 
each of these cases. Theorem [2] implies both universality and sparse-dense equivalence results. We 
will not try to be exhaustive, but rather to point out some selected conclusion. This Section contains 
definitions and statements, while proofs are deferred to section HI 

In the following, we use uppercase letters, e.g, X, Y, to denote random variables and their low- 
ercase counterparts, e.g. x, y, to denote realizations of such random variables. We also use boldface 
characters to denote random matrices, with the subscript to indicate their dimension, e.g. A.„, B„. 

Most of our results concern random matrices with i.i.d. entries and apply under some simple 
centering and normalization conditions, provided the entries have finite sixth moment. Rather than 
repeating these conditions at each of the results below, we introduce them once and for all. 

Definition 1 (Random Matrices of Standard Type). Let A„ = {^ij}i<i<m,i<j<n be a sequence of 
random matrices indexed by their dimensions m and n (with m = nin an appropriate sequence of 
integers). We say that A„ is a random matrix of standard type if the entries {Aij}ij>i form an 
array of independent and identically distributed random variables with K[Aij] = 0, E[Afj] = 1 and 
^[Afj] < K < oo, for some K independent ofm,n. 

2.1 Capacity of a CDMA System 

Code Division Multiple Access (CDMA) is a widely used communication system between multiple 
users and a common receiver [9]. The scheme consists of n users modulating their information 
sequence by a signature sequence (spreading sequence) of length m and transmitting the resulting 
signal. The number m is sometimes referred to as the spreading gain or the number of chips per 
sequence. The receiver obtains the sum of all transmitted signals and the noise which is often assumed 
to be white and Gaussian (AWGN). 

For the sake of simplicity, we will assume antipodal signals: each user wishes to communicate a 
symbol G —1}) to the common receiver. User k uses a signature sequence (Ai^, . . . , Amk)^ 
with Aik G M. The received signal Yi in the i-th. time interval is given by 



E[fiU)]-nf{V)]\ <^[aMmiUl-\0,Vr^r)\] + -bMdff{Ul\0,V:i,)\] 



i=l 




n 



= ^ Aik Xk + a 



k=l 
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where Zi are i.i.d. copies of N(0, 1) and therefore the noise power is . 

We use x'" = (xi, . . . to denote any specific reahzation of the transmitted symbols, and will 
assume that a realization of such symbols is used uniformly at random. The corresponding random 
vector is X'" = (Xi, . . . , X^) while Y = (li, . . . , Ym) is the received signal. Typically X'" is chosen 
to be uniformly distributed over {+1,-1}". In this paper we restrict to this case. However it is 
possible to generalize the results below to a large class of distributions for the symbol Xi. 

We write An for the m x n matrix {Aik}i<i<m,i<j<n- Let C„(A„) denote the capacity of such 
system, i.e. the number of bits per user that can be reliably transmitted to the common receiver 
under the above constraints. Explicitly we have 

C„(A„)=log2-^a-iEylog{ e-i^i'^-^"""^}. (3) 

a;6{+l,-l}" 

Here expectation Ey is taken over the received signal. 

Random spreading sequences were initially considered in [10]. Here, the signature sequences are 
modeled as random vectors with i.i.d. components {^ife}i<i<m,i<A:<n- Without loss of generality we 
can assume Ej^jfc} = and E{A?^} = 1. We will be interested in the large system limit m,n — )■ oo 
with a = m/n fixed. 

In order to keep the average power (per symbol) equal to 1, we will rescale the signature matrix by 
a factor 1/n. For a random signature matrix A„, we consider therefore the capacity C„(m~^/^A„), 
which is itself random. As proved in [11], Cn(m~^/^A„) does in fact concentrate exponentially 
around its expectation. This motivates us to focus on its expectation. 

Theorem 3 (Universality of the Capacity of random CDMA sytems). Let A„ = {Ajj}i<j<m,i<j<n 
and B„ = {i?ij}i<i<m,i<j<n denote two ni x n dimensional random spreading matrices of standard 
type. Then 

lim {E[C7„(m"V2A„)] _E[C„(m^i/2B„)]} =0. 

The above theorem establishes that the per-user capacity of a CDMA channel is asymptotically 
independent of the distribution of the spreading sequences. The conditions required to be satisfied 
by the distributions are milder than the ones imposed in |llj . 

Our next result concerns the sparse-dense equivalence. Sparse signature schemes were proposed in 
|12j both as a tool for simplifying mathematical analysis and as a design option with potential prac- 
tical advantages. Given a signature matrix A — {^ij}i<i<m,i<j<n 

defined as above, its sparsification 

A^ is given by 

_ f Aij with probability j/n, 
\ with probability 1— 7/n, 

with 7 > a design parameter that is kept fixed in the large system limit. Under a sparse signature 
scheme, the power per symbol is normalized to 1 if we rescale the signatures by a factor 1/^/7. The 
channel output is therefore 

1 " 

We can then prove the following sparse-dense equivalence result. 
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Theorem 4 (Sparse-Dense Equivalence for CDMA channels). Let A„ = {^ij}i<i<m,i<j<n 0,^(1 
B„ = {-Bjj}i<i<m,i<j<n denote two mxn dimensional random spreading matrices of standard type. 
For 7 > 0, let AZ be the sparsification of An- Then 

hm lim {E[C7„ (7-1/2^7)] _ EiCnin^'/^Br,)]} = 0. 

7— >oo n— >oo,m=na ^ 

As already mentioned, establishing sparse-dense equivalence is particularly useful when the anal- 
ysis of a sparse system is simpler than for its dense counterpart. In [I^ it was shown that there 
exists as > such that, for all a < a^, 

lim lim E[C„(7^i/2a;^)] = min CRs(g), (5) 

7— ^oo n— >oo,m=na m.£[0,l] 

where 

CRsiq) = ^{1 + q)- ^logXa^ -E,{log{2cosh{V\Z + X))}, (6) 
2 2a 

+ a{l - q) 

where denotes expectation with respect to Z ~ N(0, 1). The parameter as is defined as the largest 
a such that the maximizer in ^ is unique. Numerically as ~ 1.49. The same formula was derived 
earlier by Tanaka p3] using the non-rigorous replica method from statistical physics. 

Combining this with Theorem [4] we can conclude the following result for the capacity of a random 
CDMA system. 

Corollary 1 (Capacity of random CDMA systems). Let A„ denote an m x n dimensional random 
spreading matrix with i.i.d. entries. Assume E[^jj] = 0, E[^fj] = 1 and ^[Afj] < K < 00. Then for 
a < as 

lim E[C„(?n-i/2A„)] = min C7rs((z). 

n^oo,m=na gg[0,l] 



2.2 Estimation via LASSO 

The LASSO (also known as basis pursuit de-noising) is a popular strategy in statistical learning, 
used for reconstructing high-dimensional parameter vectors from noisy measurements \14:\ [T5] . It is 
particularly well suited when the underlying parameters vector is sparse in an appropriate basis. For 
this very reason, it is object of intense study within the compressed sensing literature. 

We assume here that a signal xq € is observed through the sensing matrix An which has 
dimensions mxn. The measurements y G R™' are modeled as a noisy linear functions 

y = AnXo + z, (8) 

with z G a noise vector. Let the noise vector z be i.i.d. Gaussian vector. The recovery of xq 
from y is done using the following convex optimization problem 

x{X) = argmin^gig„|^||y - A^xWl + ^ Iklli} • (9) 

For some applications the sensing matrix An is not far from random or pseudo-random. It is im- 
portant to ask to which degree results obtained for a specific distribution of An generalize to other 
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distributions [H [5]. We consider the case in which the entries XQ^i of xq are uniformly bounded, i.e., 
\xoi\ < a^max for some constant Xmax > independent of n, m. We further assume that the noise 
vector z has i.i.d. entries Zi ~ N(0,<t^) and focus on the limit 771,71—7-00 with 777/71 = a fixed. 

The next result provides rigorous evidence towards the broader universality picture, by proving 
universality for the normalized cost 

L{An) = - min |^||y - + A . (10) 

n a;6[-a;max,2.'max]" ^ J 

Theorem 5 (Universality for LASSO). Let A„ = {Aij}i<i<rn,i<j<n and B.„ = {-Bij}i<i<m,i<j<n 
denote two m x n dimensional random sensing matrices of standard type. Then 

lim |E[L(77"^/2 A„)] - E[L(77"^/2b„)] | = . 

2.3 Spectrum of Wishart matrices and capacity of MIMO channels 

Given annxn symmetric matrix Wn, let {Xi{Wn)}i<i<n denote its eigenvalues. The spectral measure 
of Wn is the probability measure 



1 " 

-T.^UW.)- (11) 



n 

i=l 

The study of the limit of as tt, — )• 00, for a sequence of random matrices W„ is a central topic 
in random matrix theory, with important applications in multi-antenna communications. A well- 
studied example is the family of Wishart matrices. Here, = ;^AJA„, where A„ is an m x 77 
matrix, whose entries are i.i.d. realizations of a zero mean random variable with variance 1. 

A standard approach to characterizing the spectral measure is through its Stieltjes transform |16] 
which is defined as 

\ 1 1 

1=1 

where z S C\M and /„ is the r7-dimensional identity matrix. The limiting spectrum of the family 
{W„}„>i can be obtained by computing lim„_^.oo 'S'n(W„, z). The universality of Wishart matrices 
is a well known result [5]. The following is a sparse-dense equivalence result for this class of matrices. 

Theorem 6 (Sparse-Dense Equivalence for Wishart Matrices). Let A„ = {^jj}i<j<m,i<j<r4 and 
B„ = {i?jj}i<j<m,i<j<r!. denote two m x n dimensional random matrices of standard type. For 
7 > 0, let AZ be the sparsification 0/ A„. Let = j^^{AZ)~^ AZ and Wb^h = 'n^^{Bn)~^^n- 

Then for all z G C\M 

lim lim {E[5„(WX„,z)]-E[5„(Wb,„,z)]} =0. 

Under appropriate tightness conditions, convergence of Stieltjes transforms implies weak conver- 
gence of the spectrum which further implies the convergence of the empirical average ^ fi^i) 
for any continuous bounded function /. As a particular application of this remark, we consider the 
capacity of multi-input multi-output (MIMO) communication systems. The channel model is very 
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similar to the CDMA system discussed in Section 12. 1[ For a channel input X = (Xi, . . . , Xn), the 
channel output is a vector Y = (li, . . . , Y^) in M™, with components 

n 

Yi = ^ HikXk + a Zi 

k=l 

where Zi are i.i.d. realizations of N(0, 1). However, in this case it is customary to not restrict the 
inputs to be { + 1,-1}, but rather to impose a power constraint Y17=i^{-^i} — ^- Given a 
channel gains matrix Hn = {-ffij}i<i<m.,i<j<n) the average capacity per input antenna [1] is then 
given by 

CniHn) = max log Det(l„ + ^HnQHj) } . 

when the input covariance is Q For the case of Hij being i.i.d. symmetric Gaussian random variables 
it was shown in [2] that the above maximum is achieved for Q = In- Here, we assume that little is 
known about the channel gains and therefore this covariance matrix is used for other matrices Hn 
as well. Under this assumption, the achievable average rate is given by 

^ m ^ 1 " 1 

i=l 4=1 

Under the above theorem implies the following result for the MIMO channels. 

Corollary 2 (Sparse-Dense Equivalence for the MIMO Capacity). Let A„ = {Ajj}i<j<m,^i<j<„ and 
B„ = {-Bij}i<i<m,i<j<n denote two mxn dimensional random matrices of standard type. Forj > 0, 
let AZ be the sparsification of An- Then 

lim hm {E[a(7-i/2A;^)] - E[C„(n-V2B„)]} =0. 

'V — i-ry-i r) — ^ry-i m='nrv 



2.4 Spin glass models 

Spin glass models have been object of intense interest within statistical mechanics, mathematical 
physics and probability theory. Both rigorous and heuristic techniques from this domain have been 
applied with success in information theory |17j . 

A number of universality and sparse-dense equivalence results have been proved in this context 
[H 118^ [19]. We re-derive two of these results here because they provide a very simple and instructive 
illustration of the proof technique that is used in the more intricate examples listed in the previous 
sections. 

We focus in particular on the Sherrington-Kirkpatrick (SK) model. The model is defined by the 
Hamiltonian function 7i : {+1,-1}" x M"^" — ^ M given by 

^ n ^ 

j,i=i ^ 

for an n X n dimensional matrix An and x = {xi, . . . ,Xn) G {+1,-1}". An important object of 
interest in this context is the free entropy density at inverse temperature /?, which is defined by 

/(/3,A„) = ilog{ e-^«(^'^")}. 

x-e{+i,-i}" 
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Universality of the free energy for the SK model was established in |20j and was later extended 
to general distributions in [21]. As shown in [T] the current approach gives a stronger result. 

Theorem 7 (Universality for the SK model [1]). Let A„, = {^ij}i<ij<n OLnd B„ = {-Bij}i<ij<n be 
two nxn dimensional random matrices. Assume that both {^ij} and {-Bij} are collections of i.i.d. 
random variables with E[Aij] = E[Bij] = 0, E[Aj.] = E[Bf.] = 1, and E[\Aij\^],E[\Bij\^] < K < oo. 
Then 

lim {E[/(/3,n-V2AJ]-E[/(/3,n-V2B„)]} =0. 

The sparse-dense equivalence was proved in [19j under the slightly stronger assumption of uni- 
formly bounded entries \Aij\ < 1 with even distribution. 

Theorem 8 (Sparse-Dense Equivalence). Let A„ = {^jj}i<ij<n and B„ = {-Bij}i<i,j<n be two nxn 
dimensional random matrices. Assume that both {Aij} and {Bij} are collections of i.i.d. random 
variables with E[Aij] = E[Bij] = 0, E[Af-] = E[Bfj] = 1, and E[\Aij\^],E[\Bijf] < K < oo. For 
7 > 0, let An be the sparsification of An. Then 

lim lim {E[/(/3,7-V2A;^)]-E[/(/3,n-i/2B„)]| 

3 Proof of Theorem [2] 

Proof of Theorem\^ Let (9J/ denote Let 

Wi = {Ui,...,UuVi+i,...,Vn)^ 

=(C/i,...,i7i_i,0,V-+i,...,K). 

Then 

n 

E[f{U)] - E[f{V)] = Y,m{W,)\ - E[/(W,_i)]) . (12) 

i=l 

From the third-order Taylor expansion, we have 

f{W^) = /(W°) + UAf{W^,) + /(PP°) + \ £^ dffiUl-\s, Vr+i)iU, - sfds . (13) 
Similarly, we get 

/(i^,_i) = /(Tr°) + vAfiw",) + ^dffiw',) + ^ £^ dffiur\s, v:i,m - ^fds. (u) 

From Eq. ([12]), using ([13]) and ([H]), we get 

n 

E[f{U)] - E[f{V)] = {Em - Vi)daiW',)] + -EiiUl - VMfiw",)] 

i=l 

+ E[i £^ dffiui-\ s, v:i,m - s?ds] +e[^ £^ dif(ut\ v:i,){v, - sfds] }. 

The result follows by noting that f{W^) is independent of {Ui, Vi}. □ 
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4 Proofs of statements from Section [2] 



We will present the proofs starting from the last example, i.e. the Sherrington-Kirkpatrick model in 
Section 12.41 As mentioned, this is a particularly simple example of the general proof strategy. 



4.1 SK Model 

As mentioned in Section 12. 4^ the Hamiltonian for this model is given by 

v2 

where An is an n x n dimensional matrix. For a function (x, An) g{x, An), we denote by {g{x, An)) 
its expectation with respect to the probability distribution p^i^ (x) oc exp{—f37i{x, An)} on {+1, — 1}". 
Explicitly: 



{g{x,An)) 



(15) 



Denote by d^^ the k-th. partial derivative with respect to A^c (row r, column c). A straightforward 
calculation shows that third derivative d^^f{f3, An) is given by 



d^J{(3,An) 



0' 



n 



which implies L^{f) < /3^/(\/2n) (with L3 defined as in Theorem[T]). 

Proof of Theorem From the definition of the random matrices A„ and B„ , we have we have 
W.[A,j] = E[5ij], E[^2.] = E[52.] E[|^ijf ] < (1 + K), E[|Sijf ] < (1 + K). Using Theorem [I]we 
get 



|E[/(n-V2A„)]-E[/(n-V2B„)]| < 



1 2 



-n 



max \ E 



6 V2 



n 



3/2 



,E 



' \ Br 



3/2 



□ 



Proof of Theorem [3 From the definition of the random matrices An and , we have we have 
^[Aij] = E[5ij], E[A2^.] = E[B2.] andE[|^7/] < {l+K)^/n, E[|5ijf ] < {l+K) (with independent 
of 7 and n). Therefore using the estimate on L^{f) fro the previous proof, together with Theorem [1] 
we have 



|E[/(7-V2a;^)] - E[/(n-i/2g^)]| < 1^2^ 



6 V2 



7' 



3/2 



,E 



^'^^^l'^"" <K';33max{^,^}. 



n 



3/2 



Therefore, lim^^oo lim„^oo {E[/(7-i/2a;^)] - E[/(n-V2B„)]} = 0. 



□ 
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4.2 CDMA 

For any m x n matrix An, the capacity ^ can be expressed as 

a:'"e{+l,-l}" ri'G{+l,-l}" 

where Z is an m-dimensional random vector, whose entries are i.i.d. N(0,(T^). By a simple change 
of variables in the sum over x, we get 

^4An)=log2-ia-i E e-^"----'-"i}. 

x'"6{+l,-l}" x6{0,2}" 

For a matrix An = {^i,j}i<i<m,i<j<n, and a vector x'" G {+1,-1}", define by letting 

= Aijx'^". Further, define the Hamiltonian function n : {0, 2}" x M"" x M""^" ^ M by 

n{x, Z, An) = —\\Z + AnX\\l = ^ E (^^ + E ^"■^ 

Then we have 



2^2 H" ■ -"-n^ 2(72 ^V"' ■ Z^"U-J 

1=1 j=i 



= log2-ia-i^ Yl Ez/(A„(x'"),Z), 

2:'"e{ + l,-l}" 



fiAn,Z) ^ llog{ 5: e"«(^'^'^")} 



x&{o,2y- 

If is a random matrix of standard type, and x'" G {+1,-1}", then A^(x'") is also a random 
matrix of standard type. In order to prove the universality results, theorems [3] and HJ it is therefore 
sufficient to fix -say- x'" = (+1, . . . , +1), and prove universality of E,zf{An, Z). 

Analogously to the proof in the previous section, for a function (x, Z, An) i— )• g{x, Z, An), we let 

In order use Theorem [2] we need to estimate the third derivatives of /. Again, d^^f denote the k-th 
derivative of / with respect to the Ar^c- The third derivative is then given by 

dU{An,Z) = ^(2a2)3 ( ~ ((^'-^^(^' ^' + ^i9rcn{x, Z, An)){{drcH{x, Z, An))^) - 2{drcn{x, Z, Ar. 

(17) 

Proof of Theorem\^ Let A„ and B„ be as defined in the theorem. Let D„(r, c, s) denote the matrix 
with entries 



-^Aji, if i < r or i = r and j < c, 

s, if i = r, and j = c, 

Bin, otherwise . 

^/m 
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From now onwards we use ^{{x) to denote Ti{x, Z, D„(r, c, s)) and let ( • ) denote the corresponding 
average, as per Eq. (fTBI) . Further, for r G [m], let Qr{x) = {Zr + X]j=i DrjXj)/{^/2a) and H^rix) = 
n{x) - Qr{xf. Notice that 



j6[m] j£[m]\r 



xf 



Accordingly, we let ( • )~r denote the average as defined in (]16p with the Hamiltonian T-L^ri^)- 
The derivative of ^(x) with respect to A^c is 

1 " 1 

Its fourth moment can then be bounded as 

- I g-w^r(x)-e,(x)2 / • 



Since the random variables e ®'-(^)^ and 6^(3;)^ are negatively correlated, we have 

(e-®^(^)'G,(x)^)., < (e-®'-(-')Vr(e.(x)^).„ 

which implies 

E((5,e^)^)s<^E(e,(x)4)^,. (18) 
a 

Using the inequality + < 27(a'^ + 6^ + c^) and the definition of {^ij} and in Theorem[3l 

we get 

mdrcHf) < ^^{E[Z^]+E[{{DrcXc)')^r]+n{i E Dr^X^)^)^r]} 

i£[n]\c 

<Ki+Kis^ + Ki E[(( Y D„x,f)^r , 

ie[n]\c 

where = K[a) is a constant independent of m, n. If we use the subscript i ^ j ^ k ^ . . . to 
denote all the tuples of distinct indices and we expand the power, we get 

Yj ^riXif)r^,] = E E[DriDrjDrkDrl{XiXjXkXl)^r] 

i&[n]\c i,j,k,l&[n]\c 

= Y E[DriDrjDrkDrl]E[{xiXjXkXi)^r] = 
i,j,k,l&[n]\c 

i&[n]\c ijLj£[n]\c 
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Here we used the fact that {Dri}i<i<n are independent of T-ir^r{x), and therefore of {xiXjXkXi)r^r 
Further all the terms with one of the indices i,j,k,l distinct from the all others vanish because 
K{Dri} = for alH / c by our assumption on A„, B„. Using Xi G {0, 2}, we then get 



E[(( ^ D„X.)')^r] < E 



•16 + 3 y ^ • 16 < ^2 



jS[n]\c ie[n]\c i^jg[n]\c 

where K2 = K2{a) is another constant. Putting everything together, we get 

and therefore, by Jensen inequality, we get K{\drc'H\^) < K^{1 + |s|^) (by eventually enlarging the 
constant K^. Using Eq. PT|) . this finally implies that 

nd'rcfi'Dn{r,c,s),Z)\]<^{l + \sf). 

We are now in position to apply Theorem [2l Since the means and variances of the entries of A„ 
and B„ are equal, we have Oj = 6j = 0. We get therefore 



m n „ 

|E[/(m-i/2A„,Z)] -E[/(m-i/2B„,Z)]| < ^ J] J] (E^,.^ / 

r=l c=l 



< mK' 



(i + kr)( 



m 



ds) 



i=3 



+ E 



□ 



The proof of Theorem [5] is very similar to the one above. We only stress the differences below. 

Proof of Theorem^ Let and B„ be as defined in the statement. We modify the definition of 
D„(r, c, s) used in the last proof, as follows 



-^A]j, if i < r or i = r and j < c, 

s, ii i = r, and j = c, 

-^Bij, otherwise. 

^/m ' 



Now following the proof of Theorem [3l and assuming without loss of generality 7 > 1, we get again 

E((5,c^f) < Kifl + s^' 



(The final step consists as in the previous proof, in bounding the sums X]j6[n]\c^[^rJ ^'^d 
T.i^j^[n]\cnDl,Dl^mxjx])^r].) This in turn implies E[|a3,/(D4r, c, s), Z)|] < {K'Jn){l + \s\^). 
Since the means and variances of the entries of An and B„ are equal, we have Oj = 6j = 0. Applying 
Theorem [2l we get 

mh-'^'Al, Z)] - E[/(m-V2B„, Z)]\ < {^A7. / (1 + " ^fds 
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i-Brcl \/ra Tj 



-'Brc 







{l + \s\'){^-sfds} 



j=3 V ' 

/ 1 1 



+ E 



Br 



Now taking the limit n — t- oo first and then the hmit 7 — ?• oo gives the result. 



} 



□ 



4.3 LASSO 

The proof of Theorem [S] repeats some arguments already present in the proof of Theorem [3] presented 
in the previous section. We shall omit such repetitions and instead focus on the new ideas required. 

Proof of Theorem\^ Without loss of generality, we will assume Xmax = 1- Define X = [—1, 1] and, 
for (5 > 0, define Xs = {k5 : A; G Z, \k5\ < 1}. In words Xs is a grid of points in the interval [—1,1] 
with spacing 5. Recall that xq is a fixed deterministic signal with ||2;o||oo ^ 1, and the resulting 
measurements read Y = A^xq + Z, where Z is noise vector with i.i.d. Gaussian component. Define 
the Hamiltonian function ^ : x M"' x M™^" R by letting 

1 

'H{x,z,An) 



+ i;\\y - ^nx\\l 



With this definition, L(^„) = ^ mina;gA'"{'H(a:, z, An)}. Let Ls{An) = ^ mmx^x^{7i{x, z, An)}. Our 
proof follows by first showing that there exists a constant C such that 

|E[L5(?i-i/2 A„)] _ E[L(?i-i/2 A„)] \<C6, \E[Ls{n-^/^Bn)] - E[L{n~^/^Bn)] \<C6. 

Obviously Ls{An) > L[An). In order to prove the converse bound, let a? be a minimizer of 7i{x, z, An) 
in X'^, and denote by xs its closest approximation in X^. Obviously \xs^i — Xi\ < 5 for all i G [n]. 
We then have 



-\nix, z, n-i/2A„)-H(x5, z, n-^/^An)\ < XS + ^ 

In 



n 



\y -n 



'1/2 A ;Jf||2 



+ — 
2n 



\y -n 
T 1 



'1/2 A 



n^S II2 



—^Anixs-x)) 2z + {—=An{x - Xs)) ^A„(x + 2:5 - 2xo 



<X5 + 



2n 



An(x Xs) 

n 



n 



2II2II2 + 7rT0"max(A„) ||x - X5||2||x + X5 - 2xo||2 

2 2n"' 



<A5 + (Jmax(n l/^A„)^5||z||2 + 2amax(n ^^^An)^d, 

Jn 



(19) 



where we used ||a;5||2, \\x\\2, \\xo\\2 < \/n and — 2;5||2 < 5^/n. Here cTniax(^n) is the largest singular 
value of A„. From |22j we know that E[(Tniax(^~"^/^A„)^] < K, for some constant K < oo. Combining 
this with Eq. (jl9p . and using the Cauchy-Schwartz inequality, we get 

|E[L5(n-i/2A„)] -E[L(n-i/2A„)]| < C5 . 
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A similar result obviously holds for the matrix ensemble B„ as well. By triangular inequality, we 
have 

lim \E[L{n-^/'^An)]-E[L{n-^/'^Bn)]\<C6+ lim \E[Ls{n-^^'^An)] - E[Ls{n-^^'^Bn)]\ . 

Since this inequality holds for any 5 > 0, the proof of the theorem reduces to showing that 
lim„^oo \E[Ls{n-^/^An)] -E[Ls{n~y^Bn)]\ = 0. 
In order to prove this, define 



/3n 

It is easy to see that 



/(5,/3,z,^„) = -^log{ Y: e-/^«(^'^'^")}. 



lim f{5, /3, z, An) = - min nix, z, An) . (20) 
;S-s>oo n xex^ 



Further, a straightforward calculation shows that 



,2 5/, 



l3'-^{6,f],z,An) = H{pp^Aj, 

where H{p) denotes Shannon's entropy of the probability distribution p and p/j^^^ (x) oc exp{— /3?{(x, z, A„ 
Of course < H{pj3^An) < nlog \Xs\ whence 



-;^iogG)<§('^'/5>^>^n)<o. (21) 

Therefore, 

lim |E[L5(n-i/2A„)] - E[L5{n-^'^Bn)]\ 

= lim I lim E[/(5, /3,Z, 7^-1/2 A„)] - lim E[/(,5, /3, Z, n-^/^B^)]! 

< lim |E[/(,5, /3, Z, n~^I^An)\ - E[f{5, /3, Z, n~^I^Bn)] | + / - log ( - ds , (22) 

where the first step follows from (j20p and the second from (j2ip . Notice the close resemblance between 
the function f{6, (3, Z, An) defined here and the one used in the previous section. Using the same 
arguments developed there for the proof of Theorem [3] it is immediate to show that 

|E[/(<5, /3, Z, n-^/^An)] - E[f{6, /?, Z, n~'/^Bn)] \ < O (^) . 

Combining this with Eq. ()22p . we get 

lim \E[Lsi7i-^^^An)] - E[Ls{n-'/^Bn)] | < ^ log . ^ , . 

The proof is completed by letting /? — )• oo. □ 
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4.4 Wishart Matrices 

The proof is analogous the proof for universaHty of the Wigner's semi-chxle law developed in [23]. 



Proof of Theorem\^ By the analiticity of the Stieltjes transform, it is sufficient to prove the claim 
for Im(z) large enough. 

For an m X n matrix An and any z € C\IR, let 



f{An) = -TT{{AlAn + zIn)-^). 

n 



In order to simplify the notation we drop the subscript n and denote the partial derivative with 
respect to Aij by dij. Define R = {AJ A + zl)~^. Therefore (^4'''^ + zI)R = /, which implies 
di-{{A~^A + zI)R) = 0. This yields 

dijR = -Rdij{A'^A)R. 

Let lij denote the matrix with (ij)-th entry equal to 1 and the remaining entries equal to 0. Then 

dij {A A) = ijiA + A lij, 
df^{A''A) = 2h,, 
d!jiA''A)=0. 

Using the identity Ti{AB) = Tr{BA), we get 



1. 



d,jf = --TT(^d,j{A'A)R 

Oy = ^Tv(^dij{A'^A)Rdij{A''A)R''^ - ^Trl^dfjiA^ A)R' 

df.f = -^Tv(dij{A~^ A)RdijiA~^ A)RdijiA~^ A)R^'^ + ^Tv(dfj{A'^ A)Rdij{A'^ A)R^ 

+ ^Tt(dij{A'^A)Rdfj{A'^A)R^y (23) 

Note that i? is a symmetric matrix and therefore is diagonalizable. Moreover, note that the singular 
values of R^^ are bounded by jt'l^^, where v = lm{z). Let ||^|| and ||^||2 denote the Probenius norm 
and the spectral norm of A respectively. From Cauchy-Schwartz inequality we have |Tr(^i?)| < 
ll^ll ||-B||. Therefore, we can bound the first term as 

\Tv{dij{A''A)Rdij{A''A)Rdij{A~'A)R^)\<\\{dij{A~'A)R)^\\d^j{^^ 

^< ^^\\dij{A''A)Rf 

< \\d,,{A^A)f, (24) 

where we have used < ||A||||i?|| in (a) and ||^-B|| < ||yl|| ||-B||2 in both (a) and (b). 

Similarly one can bound the second and third terms of (j23p as 

\TT{dl{A-'A)Rd.,{A''A)R')\ < Wdf^iA-" AMd^.iA-' A)\\^ = ||9.,(AT^)||-^. (25) 
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Finally, we can bound as follows 



< ||1,,A|| + pTl,,|| = 2\\A'U,\\ = 2(Y,A 



1/2 



(26) 



k=l 



Let us now consider the random matrices Aj^ and B„, as defined in the theorem. Let C„(r, c, s) 
denote the matrix as defined in Section HUl i-e., 



Cij 



-^A^-, a i < r or i = r and 7 < c, 

s, if i = r, and j = c, 

Bi i , otherwise . 

,/m 'J ' 



Using the equations ([23]), ([M]), (ES]), and dMl), we get 

Kn 



E{ I dlf{Cn{r, c,s))\}< ^e[{i + Y^ d) } 



fc=i 



n 



The proof is finished as for Theorem [H 



□ 



4.5 Proof of Corollary [2] 

Throughout this proof we will assume o" = 1, for simplicity of notation (general o" > follows exactly 
the same argument). 

Convergence of Stieltjes transform implies weak convergence of the expected distribution of eigen- 
values |16( Theorem 2.4.4]. This means that for any continuous bounded function /0 

n n 

hm hm -Y.nfiUl-HKVK))] = lim -Y^nf{^^{n-'BlB,,))]. (27) 

i=l i=l 

The limit on the right hand side exists because the expected distribution of eigenvalues of Wishart 
matrices converges Moreover, the limiting distribution function is continuous. Therefore, the 

convergence of the distributions implies the convergence of expectations for any bounded measurable 
function, not necessarily continuous (by the bounded convergence theorem). We are interested in 
estabilishing a result of the form (j27p for the function f{x) = log(l + x), which is not bounded. 
However, note that only the behavior of / in the region x > is relevant, because Aj > 0. In the 
domain of interest the function / is bounded from below. In order to tackle the issue of boundedness 
from above, we use a standard truncation trick. We define gM{x) = f{x)l^x<M}j foi^ some < Af < 
00. Note that the function gj^j is bounded on Therefore 

n n 

lim lim -y2E[gMiXiir\Al)^AZ))]= lim - VE[5M(A^(n-^B;[B„))] . (28) 

i=l 1=1 



^Note that we have two hmits on the left hand side. This can be taken care of by noticing that 
f{'y,n,x) = f{x) is equivalent to saying that lim„_>oo f{')n,n,x) = f{x) along any sequence of {7n}s 
satisfying lim„^oo 7n = 00. 
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Note that 



7— >oo n— >oo fi 

i=l 



1 " 

lim \im-^E{\gMiU^-HAl)^Al))-f{K{^-HAl)^AZ))\} 

=1 

lim lim -^E{log(l + A,(7-^(A;^)^A;^))l|,^>M}} 

1=1 

liin lim-j2^^1h-\KVK))/M} 

C— >oo n— >oo 77, — ' L J 



lim lim — — ^ETr 

7— >oo n— >oo Mn'y 



7^00 n^oo Mn^ 

= lim lim , 

7— >oo n— >oo Mn'y 

iy^j k i k\^k2 i,k 

K 

< — , (29) 
for a constant K independent of M, 7 as long as 7 > 1. Using a similar argument we can show that 

K' 



lim i^E{|<7A/(A,(n-iBjB„))] - /{Kin-^BlB^m < ^ (30) 

n— >oo 77, ^ — ' ^ 2Vl 

i=l 

for a constant K' independent of M. From p8]) . (1291). (l30]l we get 



lim lim \E[Cn{l~'/^AZ)]-E[Cn{n~'/'B^)]\<^^-^. 
Now taking the limM^-oo gives the desired result. 
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